{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Assignment 1: Data Preprocessing and Data Exploration\n",
    "\n",
    "This assignment contains two parts: 1) data preprocessing and 2) data exploration and analysis. The objectives of this assignment are as follow:\n",
    "\n",
    "- Handle missing values\n",
    "- Correct data format\n",
    "- Standardize and Normalize Data\n",
    "- Explore features or charecteristics to predict price of car\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Dataset description</h2>\n",
    "\n",
    "The dataset you are asked to analyze is about cars from back in 85. The dataset is available from the following link: <a href=\"https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data\">https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data</a>. This data set consists of three types of entities:\n",
    "- the specification of an auto in terms of various characteristics,\n",
    "- its assigned insurance risk rating,\n",
    "- its normalized losses in use as compared to other cars. \n",
    "\n",
    "The insurance risk rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process \"symboling\". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe. The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/specialty, etc…), and represents the average loss per car per year."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 1: Data preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>1. Import data</h3>\n",
    "\n",
    "<b>import and visualize the dataset in csv format.</b>\n",
    "\n",
    "The dataset does not contain headers (the column names are missing). You can use the following to add the column names in your imported data:\n",
    "\n",
    "`headers = [\"symboling\",\"normalized-losses\",\"make\",\"fuel-type\",\"aspiration\", \"num-of-doors\",\"body-style\", \"drive-wheels\",\"engine-location\",\"wheel-base\", \"length\",\"width\",\"height\",\"curb-weight\",\"engine-type\", \"num-of-cylinders\", \"engine-size\",\"fuel-system\",\"bore\",\"stroke\",\"compression-ratio\",\"horsepower\", \"peak-rpm\",\"city-mpg\",\"highway-mpg\",\"price\"]`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# question 1 answer\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>2. Write a code to print the following information:</h3> \n",
    "<br/>\n",
    "<b>\n",
    "<ol>\n",
    "    <li> Number of instances </li>\n",
    "    <li> Number of attributes (columns) </li>\n",
    "    <li> Name and type of each attribute </li>\n",
    "</ol>\n",
    "    </b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# question 2 answer\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3> 3. Handling missing valuees</h3>\n",
    "\n",
    "Two observations can be made by looking at the outputs from question 1 and question 2:\n",
    "\n",
    "- several question marks appeared in the dataframe in question 1; those are missing values.\n",
    "- many attributes have the data typle `object`. However by looking at the output of question 1, these attributes seem to be numerical.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4> 3.1. Identify missing values </h4>\n",
    "\n",
    "<br/>\n",
    "<b>\n",
    "<ol>\n",
    "    <li> Write a code that converts \"?\" to NaN</li>\n",
    "    <li> Write a code that counts the number of missing values (i.e. NaN) in each column. You should print each column name with the number of missing values (print 0 if there are no missing values)</li>\n",
    "</ol>\n",
    "</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# question 3.1 answer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4> 3.2. Dealing with missing data </h4>\n",
    "\n",
    "There are different approcahes to deal with missing data. Below are some examples:\n",
    "\n",
    "<ol>\n",
    "    <li>drop data<br>\n",
    "        a. drop the whole row<br>\n",
    "        b. drop the whole column\n",
    "    </li>\n",
    "    <li>replace data<br>\n",
    "        a. replace it by mean<br>\n",
    "        b. replace it by most frequent value<br>\n",
    "        c. replace it based on other functions\n",
    "    </li>\n",
    "</ol>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>\n",
    "<ol>\n",
    "    <li>For each attribute, explain which of the above approaches is best suited (add any code/visualization that helps you in making a decision)</li>\n",
    "    <li>For each attribute, apply the approach you selected in the above question to deal with missing values</li>\n",
    "    </ol>\n",
    "</b>    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "#question 3.2. answer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>4. Correct data format</h3>\n",
    "\n",
    "The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).\n",
    "\n",
    "In question 2, we saw that some columns are not of the correct data type. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'. For example, 'bore' and 'stroke' variables are numerical values that describe the engines, so we should expect them to be of the type 'float' or 'int';\n",
    "<br/>\n",
    "<b>\n",
    "<ol>\n",
    "    <li> Write a code that converts data types to proper format</li>\n",
    "    <li> list the columns and their types to make sure your conversion is correct</li>\n",
    "</ol>\n",
    "</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "#question 4 answer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>5. Data Standardization</h3>\n",
    "\n",
    "<p>Standardization is the process of transforming data into a common format which allows the researcher to make the meaningful comparison.</p>\n",
    "\n",
    "<p>In our dataset, the fuel consumption columns \"city-mpg\" and \"highway-mpg\" are represented by mpg (miles per gallon) unit. Assume we are developing an application in a country that accepts the fuel consumption with L/100km standard</p>\n",
    "<br/>\n",
    "<b>\n",
    "<ol>\n",
    "    <li> Write a code that transforms mpg to L/100km in the column of \"highway-mpg\", and change the name of column to \"highway-L/100km\".</li>\n",
    "</ol>\n",
    "</b>\n",
    "\n",
    "The formula for unit conversion is: L/100km = 235 / mpg"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "#question 5 answer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>6. Data Normalization</h3>\n",
    "\n",
    "<p>Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling variable so the variable values range from 0 to 1.\n",
    "</p>\n",
    "<br/>\n",
    "<b>\n",
    "<ol>\n",
    "    <li> Normalize the columns \"length\", \"width\" and \"height\" so their values range in [0,1] </li>\n",
    "</ol>\n",
    "</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# question 6 answer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>7. Binning</h3>\n",
    "\n",
    "<p>\n",
    "    Binning is a process of transforming continuous numerical variables into discrete categorical 'bins', for grouped analysis.\n",
    "</p>\n",
    "\n",
    "<p>In our dataset, \"horsepower\" is a real valued variable. In our analysis, we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)</p>\n",
    "<br/>\n",
    "<b>\n",
    "    <ol>\n",
    "        <li>Plot the histogram of horsepower, to see what the distribution of horsepower looks like.</li>\n",
    "        <li>Rearrange the values of 'horsepower' into 3 bins of equal size bandwidth named 'Low', 'Medium', 'High' and place them in a new column named 'horsepower_categories'</li>\n",
    "        <li> Print the number of vehicles in each bin </li>\n",
    "        <li>Visualize the distribution of bins created above using a histogram </li>\n",
    "    </ol>\n",
    "    </b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "#question 7 answer\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 2: Explore features or charecteristics to predict price of car"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>1. Analyzing Individual Feature Patterns using Visualization</h3>\n",
    "\n",
    "<p>When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.</p>\n",
    "\n",
    "<u>Continuous numerical variables:</u> \n",
    "\n",
    "<p>Continuous numerical variables are variables that may contain any value within some range. Continuous numerical variables can have the type \"int64\" or \"float64\". A great way to visualize these variables is by using scatterplots with fitted lines (you can use <a href=\"https://seaborn.pydata.org/generated/seaborn.regplot.html\">regplot</a> from seaborn package).</p>\n",
    "\n",
    "<u>Categorical variables</u>\n",
    "\n",
    "<p>These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type \"object\" or \"int64\". A good way to visualize categorical variables is by using <a href=\"https://seaborn.pydata.org/generated/seaborn.boxplot.html\">boxplots</a>.</p>\n",
    "<br/>\n",
    "\n",
    "<p>Answer the following questions to analyze which attributes can predict the price of a car</p>\n",
    "<br/>\n",
    "<b>\n",
    "<ol>\n",
    "    <li>Compute the Pearson Correlation between \"price\" and each of the following attributes: engine-size, highway-mpg, peak-rpm, stroke</li>\n",
    "    <li> Visualize the results using a heatmap</li>\n",
    "    <li>Visualize the relationship between \"price\" and each of the above attributes using the function \"regplot\"</li>\n",
    "    <li> Examine whether the probability that the correlation between \"price\" and each of the above variables is statistically significant (hint: you should compute and analyze p_value)</li>\n",
    "        <li>Based on the above results, for each attribute, justify why it can or cannot be a good predictor of \"price\" </li>\n",
    "    <li>Visualize the relationship between \"price\" and each of the following attributes: body-style, engine-location, drive-wheels using boxplots. Based on the results, explain why each of these attributes can or cannot be a good predictor of \"price\"</li>\n",
    "    </ol>\n",
    "    </b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#question 1 answer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>2. Basics of Grouping</h3>\n",
    "\n",
    "<p>The \"groupby\" method groups data by different categories. The data is grouped based on one or several variables and analysis is performed on the individual groups.</p>\n",
    "<br/>\n",
    "<b>\n",
    "    <ol>\n",
    "        <li>Write a code that allows us to know, on average, which type of drive wheel makes a vehicle more expensive</li>\n",
    "        <li>Write a code that allows us to know, on average, which combination of drive wheel and body style make a vehicle more expensive</li>\n",
    "    </ol>\n",
    "    </b>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# question 2 answer"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
